only use nightly pytorch in ci #243

tushar00jain · 2025-07-26T00:02:03Z

Summary:

change ci to only use nightly since block_current_stream is not in stable yet
fix new errors in nightly version of pyre
- remove fixme[29] about future not being a function
- make reduce_scatter_quantized return Work object

Stack created with Sapling. Best reviewed with ReviewStack.

Summary: use http transport instead of pg transport -- pg transport fails to resolve address when running locally

d4l3k · 2025-07-28T18:38:30Z

torchft/manager.py

@@ -382,7 +382,7 @@ def allreduce(self, tensor: torch.Tensor, should_quantize: bool = False) -> Work
                )
            else:
                work = self._pg.allreduce([tensor], ReduceOp.SUM)
-                work.wait()
+                work.block_current_stream()


this partially solves it but it doesn't really help the case below with the tensor division

Ideally we wrap the future below in a Work object and then call .block_current_stream() on that

this partially solves it but it doesn't really help the case below with the tensor division

in which case?

for nccl with cuda, the behavior should be the same as the existing one

for gloo with cuda, the tensor is on the gpu (after the host to device copy but the tensor arg is also on gpu) so the callback will also return immediately? iiuc the fut.wait() in the callback that i added will also return immediately

for gloo without cuda, based on what you said the callback will be called after the device to host copy has been completed?

Ideally we wrap the future below in a Work object and then call .block_current_stream() on that

we also need to call work.wait() or work.block_current_stream() to make sure work finishes on the current stream first before the future runs on the current stream

d4l3k

LGTM

We should document in README that we require torch nightly

Summary: - call future.wait in callbacks to make sure the continuation executes after the future has completed - set the stream correctly to execute callback scheduled by bucketized allreduce

Summary: returns the work object so we can be more flexible with the usage

Summary: - change ci to only use nightly since block_current_stream is not in stable yet - fix new errors in nightly version of pyre - remove fixme[29] about future not being a function - make reduce_scatter_quantized return Work object

tushar00jain mentioned this pull request Jul 26, 2025

fix compute/communication overlap for gloo #240

Merged

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 26, 2025

tushar00jain force-pushed the pr243 branch 5 times, most recently from 538ea8a to dfe09bf Compare July 26, 2025 01:12

use http transport

fef4abc

Summary: use http transport instead of pg transport -- pg transport fails to resolve address when running locally

tushar00jain force-pushed the pr243 branch from dfe09bf to bbd92d5 Compare July 26, 2025 01:59

This was referenced Jul 26, 2025

use http transport #244

Merged

fix stream dependencies in callbacks #246

Merged

return work from manager allreduce #247

Merged

make checkpointing thread safe #245

Merged

tushar00jain force-pushed the pr243 branch 5 times, most recently from 8017105 to bfe221f Compare July 26, 2025 17:51

tushar00jain mentioned this pull request Jul 26, 2025

setup stream dependencies inside work wrapper #248

Merged

tushar00jain force-pushed the pr243 branch from bfe221f to 6673970 Compare July 26, 2025 19:11

d4l3k reviewed Jul 28, 2025

View reviewed changes

tushar00jain force-pushed the pr243 branch 8 times, most recently from 5507e47 to 61e177c Compare July 29, 2025 04:18

tushar00jain mentioned this pull request Jul 29, 2025

fix managed pg allreduce #249

Open

tushar00jain force-pushed the pr243 branch 7 times, most recently from 4698c70 to d6b54f7 Compare July 29, 2025 18:09

tushar00jain changed the title ~~option 1 - use block_current to overlap compute/communication~~ only use nightly pytorch in ci Jul 29, 2025

tushar00jain force-pushed the pr243 branch 2 times, most recently from cb39d98 to 685e3c4 Compare July 29, 2025 22:32

d4l3k approved these changes Jul 30, 2025

View reviewed changes

tushar00jain force-pushed the pr243 branch 11 times, most recently from d650c7a to 09208a0 Compare July 31, 2025 02:47

tushar00jain added 3 commits July 31, 2025 18:55

fix stream dependencies in callbacks

09bbdea

Summary: - call future.wait in callbacks to make sure the continuation executes after the future has completed - set the stream correctly to execute callback scheduled by bucketized allreduce

return work from manager allreduce

9683ef4

Summary: returns the work object so we can be more flexible with the usage

only use nightly pytorch in ci

07446f6

Summary: - change ci to only use nightly since block_current_stream is not in stable yet - fix new errors in nightly version of pyre - remove fixme[29] about future not being a function - make reduce_scatter_quantized return Work object

tushar00jain force-pushed the pr243 branch from 09208a0 to 07446f6 Compare August 1, 2025 01:55

Merge branch 'main' into pr243

0e11fd5

tushar00jain merged commit b746582 into pytorch:main Aug 1, 2025
8 of 10 checks passed

tushar00jain deleted the pr243 branch August 1, 2025 06:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

only use nightly pytorch in ci #243

only use nightly pytorch in ci #243

Uh oh!

tushar00jain commented Jul 26, 2025 •

edited

Loading

Uh oh!

d4l3k Jul 28, 2025

Uh oh!

tushar00jain Jul 28, 2025

Uh oh!

tushar00jain Jul 28, 2025

Uh oh!

d4l3k left a comment

Uh oh!

Uh oh!

Uh oh!

only use nightly pytorch in ci #243

only use nightly pytorch in ci #243

Uh oh!

Conversation

tushar00jain commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d4l3k Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tushar00jain commented Jul 26, 2025 •

edited

Loading